-
Notifications
You must be signed in to change notification settings - Fork 3.7k
[enhancement](parquet)support column predicate tree min-max filter for parquet page index. #57771
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…r parquet page index.
|
Thank you for your contribution to Apache Doris. Please clearly describe your PR:
|
|
run buildall |
TPC-DS: Total hot run time: 190261 ms |
ClickBench: Total hot run time: 27.44 s |
1988815 to
430ea3c
Compare
|
run buildall |
TPC-DS: Total hot run time: 189821 ms |
ClickBench: Total hot run time: 27.69 s |
|
run buildall |
TPC-H: Total hot run time: 34442 ms |
TPC-DS: Total hot run time: 188197 ms |
ClickBench: Total hot run time: 27.22 s |
|
run buildall |
TPC-H: Total hot run time: 34480 ms |
TPC-DS: Total hot run time: 187984 ms |
ClickBench: Total hot run time: 27.38 s |
BE UT Coverage ReportIncrement line coverage Increment coverage report
|
BE Regression && UT Coverage ReportIncrement line coverage Increment coverage report
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR refactors the Parquet page index filtering implementation by improving the handling of OR predicates and removing the colname_to_value_range parameter from various reader initialization methods. The changes optimize page index filtering to better support complex predicate combinations.
Key Changes:
- Refactored page index filtering logic to support OR predicates with a new
evaluate_andmethod that works withCachedPageIndexStatandRowRanges - Removed the
colname_to_value_rangeparameter from all readerinit_readermethods (ORC, Parquet, Iceberg, Paimon, Hudi, Hive, etc.) - Introduced
RowRangesas a unified structure for representing row ranges to read, replacing the previousvector<RowRange>approach
Reviewed Changes
Copilot reviewed 38 out of 40 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
| regression-test/suites/external_table_p0/hive/test_hive_page_index.groovy | New test suite for Hive page index filtering with various predicate combinations |
| docker/thirdparties/docker-compose/hive/scripts/preinstalled_data/parquet_table/decimals_1_10/decimals_1_10.parquet | Binary test data file for decimal column tests |
| docker/thirdparties/docker-compose/hive/scripts/create_preinstalled_scripts/run82.hql | HQL script to create test table for decimals |
| be/src/vec/exec/format/parquet/vparquet_reader.{h,cpp} | Major refactoring of page index filtering, min-max-bloom filter processing, and row group iteration logic |
| be/src/vec/exec/format/parquet/vparquet_group_reader.{h,cpp} | Updated to use RowRanges instead of vector<RowRange> |
| be/src/vec/exec/format/parquet/vparquet_column_reader.{h,cpp} | Updated column readers to work with RowRanges |
| be/src/vec/exec/format/parquet/parquet_predicate.h | Added PageIndexStat and CachedPageIndexStat structures, renamed get_min_max_value to parse_min_max_value |
| be/src/vec/exec/format/parquet/vparquet_page_index.{h,cpp} | Removed unused create_skipped_row_range method, made parse methods const |
| be/src/vec/exec/format/parquet/parquet_common.h | Replaced custom RowRange with segment_v2::RowRange and RowRanges |
| be/src/vec/exec/format/orc/vorc_reader.{h,cpp} | Removed colname_to_value_range parameter from init_reader |
| be/src/vec/exec/format/table/*.{h,cpp} | Updated table format readers (Iceberg, Paimon, Hudi, Hive, TransactionalHive) to remove colname_to_value_range parameter |
| be/src/vec/exec/scan/file_scanner.cpp | Updated all reader initialization calls to remove colname_to_value_range |
| be/src/olap/rowset/segment_v2/row_ranges.h | Added get_range method to RowRanges |
| be/src/olap/column_predicate.h | Added new evaluate_and method signature for page index filtering |
| be/src/olap/block_column_predicate.{h,cpp} | Implemented page index filtering for AND/OR block predicates |
| be/src/olap/comparison_predicate.h | Added page index filtering support for comparison predicates |
| be/src/olap/in_list_predicate.h | Added page index filtering support for IN list predicates |
| be/src/olap/null_predicate.h | Added page index filtering support for NULL predicates |
| be/src/olap/push_handler.{h,cpp} | Removed unused colname_to_value_range member |
| be/test/vec/exec/*.cpp | Updated test code to remove colname_to_value_range parameter, removed obsolete test methods |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| std::function<bool(PageIndexStat**, int)> get_stat_func; | ||
| }; | ||
|
|
||
| // The encoded Parquet min-max value is parsed into `fields`; |
Copilot
AI
Nov 10, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The comment "The encoded Parquet min-max value is parsed into fields" has an extra space between "into" and the backtick. It should be a single space.
| // The encoded Parquet min-max value is parsed into `fields`; | |
| // The encoded Parquet min-max value is parsed into `fields`; |
|
|
||
| String enabled = context.config.otherConfigs.get("enableHiveTest") | ||
| if (enabled == null || !enabled.equalsIgnoreCase("true")) { | ||
| logger.info("diable Hive test.") |
Copilot
AI
Nov 10, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The spelling "diable" should be "disable".
| order_qt_q33 """ select * from decimals_1_10 where d_1 is null or d_10 is null ; """ | ||
| order_qt_q33 """ select * from decimals_1_10 where d_1 is null""" | ||
| order_qt_q33 """ select * from decimals_1_10 where d_10 is null ; """ |
Copilot
AI
Nov 10, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Duplicate query identifiers detected. Lines 84, 85, and 86 all use the same identifier order_qt_q33, but they are executing different queries. Each query should have a unique identifier, such as order_qt_q34 and order_qt_q35 for the second and third queries respectively.
| RowRanges* candidate_row_ranges); | ||
|
|
||
| // Row Group Filter | ||
| // check this range contain this tow group. |
Copilot
AI
Nov 10, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The comment "check this range contain this tow group" has a spelling error. "tow" should be "row".
| // check this range contain this tow group. | |
| // check this range contain this row group. |
|
run buildall |
TPC-H: Total hot run time: 34203 ms |
TPC-DS: Total hot run time: 187676 ms |
ClickBench: Total hot run time: 27.54 s |
BE UT Coverage ReportIncrement line coverage Increment coverage report
|
BE Regression && UT Coverage ReportIncrement line coverage Increment coverage report
|
|
PR approved by at least one committer and no changes requested. |
|
PR approved by anyone and no changes requested. |
…ition.columns prop table cause be core. (#58532) ### What problem does this PR solve? Related PR: #57771 Problem Summary: Fixed a core issue when reading Hudi Parquet format tables with the `hoodie.properties` `hoodie.datasource.write.drop.partition.columns=false`. ``` *** SIGSEGV address not mapped to object (@0x18) received by PID 12234 (TID 38368 OR 0x7f0bd279e640) from PID 24; stack trace: *** 11:01:31 0# doris::signal::(anonymous namespace)::FailureSignalHandler(int, siginfo_t*, void*) at /home/zcp/repo_center/doris_master/doris/be/src/common/signal_handler.h:420 11:01:31 1# PosixSignals::chained_handler(int, siginfo*, void*) [clone .part.0] in /usr/lib/jvm/java-17-openjdk-amd64/lib/server/libjvm.so 11:01:31 2# JVM_handle_linux_signal in /usr/lib/jvm/java-17-openjdk-amd64/lib/server/libjvm.so 11:01:31 3# 0x00007F18963FB520 in /lib/x86_64-linux-gnu/libc.so.6 11:01:31 4# std::_Function_handler<bool (doris::vectorized::ParquetPredicate::PageIndexStat**, int), doris::vectorized::ParquetReader::_process_page_index_filter(tparquet::RowGroup const&, doris::vectorized::RowGroupReader::RowGroupIndex const&, std::vector<std::unique_ptr<doris::MutilColumnBlockPredicate, std::default_delete<doris::MutilColumnBlockPredicate> >, std::allocator<std::unique_ptr<doris::MutilColumnBlockPredicate, std::default_delete<doris::MutilColumnBlockPredicate> > > > const&, doris::segment_v2::RowRanges*)::$_1>::_M_invoke(std::_Any_data const&, doris::vectorized::ParquetPredicate::PageIndexStat**&&, int&&) at /usr/local/ldb-toolchain-v0.26/bin/../lib/gcc/x86_64-pc-linux-gnu/15/include/g++-v15/bits/std_function.h:292 11:01:31 5# doris::InListPredicateBase<(doris::PrimitiveType)2, (doris::PredicateType)7, doris::HybridSet<(doris::PrimitiveType)2, doris::FixedContainer<bool, 1ul>, doris::vectorized::PredicateColumnType<(doris::PrimitiveType)2> > >::evaluate_and(doris::vectorized::ParquetPredicate::CachedPageIndexStat*, doris::segment_v2::RowRanges*) const at /home/zcp/repo_center/doris_master/doris/be/src/olap/in_list_predicate.h:345 11:01:31 6# doris::AndBlockColumnPredicate::evaluate_and(doris::vectorized::ParquetPredicate::CachedPageIndexStat*, doris::segment_v2::RowRanges*) const at /home/zcp/repo_center/doris_master/doris/be/src/olap/block_column_predicate.cpp:148 11:01:31 7# doris::vectorized::ParquetReader::_process_page_index_filter(tparquet::RowGroup const&, doris::vectorized::RowGroupReader::RowGroupIndex const&, std::vector<std::unique_ptr<doris::MutilColumnBlockPredicate, std::default_delete<doris::MutilColumnBlockPredicate> >, std::allocator<std::unique_ptr<doris::MutilColumnBlockPredicate, std::default_delete<doris::MutilColumnBlockPredicate> > > > const&, doris::segment_v2::RowRanges*) in /mnt/hdd01/ci/doris-deploy-master-local/be/lib/doris_be 11:01:31 8# doris::vectorized::ParquetReader::_process_min_max_bloom_filter(doris::vectorized::RowGroupReader::RowGroupIndex const&, tparquet::RowGroup const&, std::vector<std::unique_ptr<doris::MutilColumnBlockPredicate, std::default_delete<doris::MutilColumnBlockPredicate> >, std::allocator<std::unique_ptr<doris::MutilColumnBlockPredicate, std::default_delete<doris::MutilColumnBlockPredicate> > > > const&, doris::segment_v2::RowRanges*) at /home/zcp/repo_center/doris_master/doris/be/src/vec/exec/format/parquet/vparquet_reader.cpp:1082 11:01:31 9# doris::vectorized::ParquetReader::_next_row_group_reader() in /mnt/hdd01/ci/doris-deploy-master-local/be/lib/doris_be 11:01:31 10# doris::vectorized::ParquetReader::get_next_block(doris::vectorized::Block*, unsigned long*, bool*) at /home/zcp/repo_center/doris_master/doris/be/src/vec/exec/format/parquet/vparquet_reader.cpp:598 11:01:31 11# doris::vectorized::HudiReader::get_next_block_inner(doris::vectorized::Block*, unsigned long*, bool*) at /home/zcp/repo_center/doris_master/doris/be/src/vec/exec/format/table/hudi_reader.cpp:29 11:01:31 12# doris::vectorized::TableFormatReader::get_next_block(doris::vectorized::Block*, unsigned long*, bool*) at /home/zcp/repo_center/doris_master/doris/be/src/vec/exec/format/table/table_format_reader.h:82 11:01:31 13# doris::vectorized::FileScanner::_get_block_wrapped(doris::RuntimeState*, doris::vectorized::Block*, bool*) at /home/zcp/repo_center/doris_master/doris/be/src/vec/exec/scan/file_scanner.cpp:472 ```
…r parquet page index. (apache#57771) Problem Summary: 1. The previous page index could only handle SQL WHERE conditions that only contained AND, but this PR can handle conditions that contain OR. 2. Because the topn runtime filter is dynamically maintained, this PR delays the timing of the topn RF min-max filter until the row group reader is created.
…ition.columns prop table cause be core. (apache#58532) ### What problem does this PR solve? Related PR: apache#57771 Problem Summary: Fixed a core issue when reading Hudi Parquet format tables with the `hoodie.properties` `hoodie.datasource.write.drop.partition.columns=false`. ``` *** SIGSEGV address not mapped to object (@0x18) received by PID 12234 (TID 38368 OR 0x7f0bd279e640) from PID 24; stack trace: *** 11:01:31 0# doris::signal::(anonymous namespace)::FailureSignalHandler(int, siginfo_t*, void*) at /home/zcp/repo_center/doris_master/doris/be/src/common/signal_handler.h:420 11:01:31 1# PosixSignals::chained_handler(int, siginfo*, void*) [clone .part.0] in /usr/lib/jvm/java-17-openjdk-amd64/lib/server/libjvm.so 11:01:31 2# JVM_handle_linux_signal in /usr/lib/jvm/java-17-openjdk-amd64/lib/server/libjvm.so 11:01:31 3# 0x00007F18963FB520 in /lib/x86_64-linux-gnu/libc.so.6 11:01:31 4# std::_Function_handler<bool (doris::vectorized::ParquetPredicate::PageIndexStat**, int), doris::vectorized::ParquetReader::_process_page_index_filter(tparquet::RowGroup const&, doris::vectorized::RowGroupReader::RowGroupIndex const&, std::vector<std::unique_ptr<doris::MutilColumnBlockPredicate, std::default_delete<doris::MutilColumnBlockPredicate> >, std::allocator<std::unique_ptr<doris::MutilColumnBlockPredicate, std::default_delete<doris::MutilColumnBlockPredicate> > > > const&, doris::segment_v2::RowRanges*)::$_1>::_M_invoke(std::_Any_data const&, doris::vectorized::ParquetPredicate::PageIndexStat**&&, int&&) at /usr/local/ldb-toolchain-v0.26/bin/../lib/gcc/x86_64-pc-linux-gnu/15/include/g++-v15/bits/std_function.h:292 11:01:31 5# doris::InListPredicateBase<(doris::PrimitiveType)2, (doris::PredicateType)7, doris::HybridSet<(doris::PrimitiveType)2, doris::FixedContainer<bool, 1ul>, doris::vectorized::PredicateColumnType<(doris::PrimitiveType)2> > >::evaluate_and(doris::vectorized::ParquetPredicate::CachedPageIndexStat*, doris::segment_v2::RowRanges*) const at /home/zcp/repo_center/doris_master/doris/be/src/olap/in_list_predicate.h:345 11:01:31 6# doris::AndBlockColumnPredicate::evaluate_and(doris::vectorized::ParquetPredicate::CachedPageIndexStat*, doris::segment_v2::RowRanges*) const at /home/zcp/repo_center/doris_master/doris/be/src/olap/block_column_predicate.cpp:148 11:01:31 7# doris::vectorized::ParquetReader::_process_page_index_filter(tparquet::RowGroup const&, doris::vectorized::RowGroupReader::RowGroupIndex const&, std::vector<std::unique_ptr<doris::MutilColumnBlockPredicate, std::default_delete<doris::MutilColumnBlockPredicate> >, std::allocator<std::unique_ptr<doris::MutilColumnBlockPredicate, std::default_delete<doris::MutilColumnBlockPredicate> > > > const&, doris::segment_v2::RowRanges*) in /mnt/hdd01/ci/doris-deploy-master-local/be/lib/doris_be 11:01:31 8# doris::vectorized::ParquetReader::_process_min_max_bloom_filter(doris::vectorized::RowGroupReader::RowGroupIndex const&, tparquet::RowGroup const&, std::vector<std::unique_ptr<doris::MutilColumnBlockPredicate, std::default_delete<doris::MutilColumnBlockPredicate> >, std::allocator<std::unique_ptr<doris::MutilColumnBlockPredicate, std::default_delete<doris::MutilColumnBlockPredicate> > > > const&, doris::segment_v2::RowRanges*) at /home/zcp/repo_center/doris_master/doris/be/src/vec/exec/format/parquet/vparquet_reader.cpp:1082 11:01:31 9# doris::vectorized::ParquetReader::_next_row_group_reader() in /mnt/hdd01/ci/doris-deploy-master-local/be/lib/doris_be 11:01:31 10# doris::vectorized::ParquetReader::get_next_block(doris::vectorized::Block*, unsigned long*, bool*) at /home/zcp/repo_center/doris_master/doris/be/src/vec/exec/format/parquet/vparquet_reader.cpp:598 11:01:31 11# doris::vectorized::HudiReader::get_next_block_inner(doris::vectorized::Block*, unsigned long*, bool*) at /home/zcp/repo_center/doris_master/doris/be/src/vec/exec/format/table/hudi_reader.cpp:29 11:01:31 12# doris::vectorized::TableFormatReader::get_next_block(doris::vectorized::Block*, unsigned long*, bool*) at /home/zcp/repo_center/doris_master/doris/be/src/vec/exec/format/table/table_format_reader.h:82 11:01:31 13# doris::vectorized::FileScanner::_get_block_wrapped(doris::RuntimeState*, doris::vectorized::Block*, bool*) at /home/zcp/repo_center/doris_master/doris/be/src/vec/exec/scan/file_scanner.cpp:472 ```
…r parquet page index. (apache#57771) Problem Summary: 1. The previous page index could only handle SQL WHERE conditions that only contained AND, but this PR can handle conditions that contain OR. 2. Because the topn runtime filter is dynamically maintained, this PR delays the timing of the topn RF min-max filter until the row group reader is created.
…x filter for parquet page index. (#57771) (#59680) bp #57771 ### What problem does this PR solve? Problem Summary: 1. The previous page index could only handle SQL WHERE conditions that only contained AND, but this PR can handle conditions that contain OR. 2. Because the topn runtime filter is dynamically maintained, this PR delays the timing of the topn RF min-max filter until the row group reader is created.
…ition.columns prop table cause be core. (#58532) ### What problem does this PR solve? Related PR: #57771 Problem Summary: Fixed a core issue when reading Hudi Parquet format tables with the `hoodie.properties` `hoodie.datasource.write.drop.partition.columns=false`. ``` *** SIGSEGV address not mapped to object (@0x18) received by PID 12234 (TID 38368 OR 0x7f0bd279e640) from PID 24; stack trace: *** 11:01:31 0# doris::signal::(anonymous namespace)::FailureSignalHandler(int, siginfo_t*, void*) at /home/zcp/repo_center/doris_master/doris/be/src/common/signal_handler.h:420 11:01:31 1# PosixSignals::chained_handler(int, siginfo*, void*) [clone .part.0] in /usr/lib/jvm/java-17-openjdk-amd64/lib/server/libjvm.so 11:01:31 2# JVM_handle_linux_signal in /usr/lib/jvm/java-17-openjdk-amd64/lib/server/libjvm.so 11:01:31 3# 0x00007F18963FB520 in /lib/x86_64-linux-gnu/libc.so.6 11:01:31 4# std::_Function_handler<bool (doris::vectorized::ParquetPredicate::PageIndexStat**, int), doris::vectorized::ParquetReader::_process_page_index_filter(tparquet::RowGroup const&, doris::vectorized::RowGroupReader::RowGroupIndex const&, std::vector<std::unique_ptr<doris::MutilColumnBlockPredicate, std::default_delete<doris::MutilColumnBlockPredicate> >, std::allocator<std::unique_ptr<doris::MutilColumnBlockPredicate, std::default_delete<doris::MutilColumnBlockPredicate> > > > const&, doris::segment_v2::RowRanges*)::$_1>::_M_invoke(std::_Any_data const&, doris::vectorized::ParquetPredicate::PageIndexStat**&&, int&&) at /usr/local/ldb-toolchain-v0.26/bin/../lib/gcc/x86_64-pc-linux-gnu/15/include/g++-v15/bits/std_function.h:292 11:01:31 5# doris::InListPredicateBase<(doris::PrimitiveType)2, (doris::PredicateType)7, doris::HybridSet<(doris::PrimitiveType)2, doris::FixedContainer<bool, 1ul>, doris::vectorized::PredicateColumnType<(doris::PrimitiveType)2> > >::evaluate_and(doris::vectorized::ParquetPredicate::CachedPageIndexStat*, doris::segment_v2::RowRanges*) const at /home/zcp/repo_center/doris_master/doris/be/src/olap/in_list_predicate.h:345 11:01:31 6# doris::AndBlockColumnPredicate::evaluate_and(doris::vectorized::ParquetPredicate::CachedPageIndexStat*, doris::segment_v2::RowRanges*) const at /home/zcp/repo_center/doris_master/doris/be/src/olap/block_column_predicate.cpp:148 11:01:31 7# doris::vectorized::ParquetReader::_process_page_index_filter(tparquet::RowGroup const&, doris::vectorized::RowGroupReader::RowGroupIndex const&, std::vector<std::unique_ptr<doris::MutilColumnBlockPredicate, std::default_delete<doris::MutilColumnBlockPredicate> >, std::allocator<std::unique_ptr<doris::MutilColumnBlockPredicate, std::default_delete<doris::MutilColumnBlockPredicate> > > > const&, doris::segment_v2::RowRanges*) in /mnt/hdd01/ci/doris-deploy-master-local/be/lib/doris_be 11:01:31 8# doris::vectorized::ParquetReader::_process_min_max_bloom_filter(doris::vectorized::RowGroupReader::RowGroupIndex const&, tparquet::RowGroup const&, std::vector<std::unique_ptr<doris::MutilColumnBlockPredicate, std::default_delete<doris::MutilColumnBlockPredicate> >, std::allocator<std::unique_ptr<doris::MutilColumnBlockPredicate, std::default_delete<doris::MutilColumnBlockPredicate> > > > const&, doris::segment_v2::RowRanges*) at /home/zcp/repo_center/doris_master/doris/be/src/vec/exec/format/parquet/vparquet_reader.cpp:1082 11:01:31 9# doris::vectorized::ParquetReader::_next_row_group_reader() in /mnt/hdd01/ci/doris-deploy-master-local/be/lib/doris_be 11:01:31 10# doris::vectorized::ParquetReader::get_next_block(doris::vectorized::Block*, unsigned long*, bool*) at /home/zcp/repo_center/doris_master/doris/be/src/vec/exec/format/parquet/vparquet_reader.cpp:598 11:01:31 11# doris::vectorized::HudiReader::get_next_block_inner(doris::vectorized::Block*, unsigned long*, bool*) at /home/zcp/repo_center/doris_master/doris/be/src/vec/exec/format/table/hudi_reader.cpp:29 11:01:31 12# doris::vectorized::TableFormatReader::get_next_block(doris::vectorized::Block*, unsigned long*, bool*) at /home/zcp/repo_center/doris_master/doris/be/src/vec/exec/format/table/table_format_reader.h:82 11:01:31 13# doris::vectorized::FileScanner::_get_block_wrapped(doris::RuntimeState*, doris::vectorized::Block*, bool*) at /home/zcp/repo_center/doris_master/doris/be/src/vec/exec/scan/file_scanner.cpp:472 ```
What problem does this PR solve?
Problem Summary:
Release note
None
Check List (For Author)
Test
Behavior changed:
Does this need documentation?
Check List (For Reviewer who merge this PR)